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Abstract 


We present an improved analysis of mini-batched stochastic dual coordinate as¬ 
cent for regularized empirical loss minimization (i.e. SVM and SVM-type objec¬ 
tives). Our analysis allows for flexible sampling schemes, including where data is 
distribute across machines, and combines a dependence on the smoothness of the 
loss and/or the data spread (measured through the spectral norm). 


1 Introduction 

Stochastic optimization approaches have significant theoretical and empirical advantages in training 
linear Support Vector Machines (SVMs) and other regularized loss minimization problems, and are 
often the methods of choice in practice. Such methods use a single, randomly chosen, training 
example at each iteration. In the context of SVMs, many variations of stochastic gradient descent 
(SGD) have been suggested, based on primal stochastic gradients (e.g. Pegasos lf25l . NORMA 1301 , 
SAG (221, MISO 02), S2GD (TT), mS2GD HO) and Prox-SVRG SEED)- In this paper we focus on 
SDCA—stochastic dual coordinate ascent—which is based on improvements to the dual problem, 
again considering only a single randomly chosen training example, and thus only a single randomly 
chosen dual variable, at each iteration Emma. Especially when accurate solutions are desired, 
SDCA has better complexity guarantee, and often performs better in practice than SGD 17112311. 

The inherent sequential nature of such approaches becomes a problematic limitation in parallel and 
distributed settings as the predictor must be updated after each training point is processed, providing 
very little opportunity for parallelization. A popular remedy is to use mini-batches : the use several 
training points at each iteration, calculating the update based on each point separately and aggre¬ 
gating the updates. The question is then whether basing each iteration on several points can indeed 
reduce the number of required iterations, and thus yield parallelization speedups. 

For SGD with a non-smooth loss, mini-batching does not reduce the number of worst-case required 
iterations and thus does not allow parallel speedups in the worst case. However, when the loss 
function is smooth, mini-batching can be beneficial and linear speedups can be obtained, even when 
the mini-batch sizes scales polynomially with the total training set size mm®. Alternatively, even 
for non-smooth loss, linear speedups can also be ensured if the data is reasonably well-spread, as 
measured by the spectral norm of the data, as long as the mini-batch size is not larger then the inverse 
of this spectral norm (27). 

For SDCA, using a mini-batch corresponds to updating multiple coordinates concurrently and in¬ 
dependently. If appropriate care is taken with the updates (see Section[6]i, then using a mini-batch 
size as large as the inverse spectral norm leads to a reduction in the number of iterations, and allows 
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linear parallelization speedups, even when the loss is non-smooth J2]|27). This parallels the SGD 
mini-batch analysis for non-smooth loss. But can mini-batching also be beneficial for SDCA with 
smooth losses and without a data-spread (spectral norm) assumptions, as with SGD? What mini¬ 
batch sizes allow for parallel speedups? In this paper we answer these questions and show that as 
with SGD, when the loss function is smooth, using mini-batches with SDCA yields a linear reduc¬ 
tion in the number of iterations and thus allows for linear parallelization speedups, up to similar 
polynomial limits on the mini-batch size. Furthermore, we provide an analysis that combines the 
benefits of smoothness with the data-dependent benefits of a low spectral norm, and thus allows for 
even large mini-batch sizes when the loss is smooth and the data is well-spread. 

Another issue that we address is the way mini-batches are sampled. Straight-forward mini-batch 
analysis, including previous analysis of mini-batch SDCA |f27l , assume that at each iteration we 
pick a mini-batch of size b uniformly at random from among all subsets of b training examples. In 
practice, though, data is often partitioned between C < b machines, and at each iteration b/C points 
are samples from each machine, yielding a mini-batch that is not uniformly distributed among all 
possible subsets (e.g. we have zero probability of using b points from the same machine as a mini¬ 
batch). Other architectural restrictions might lead to different sampling schemes. The analysis we 
present can be easily applied to different sampling schemes, and in particular we consider distributed 
sampling as described above and show that essentially the same guarantees (with minor modifica¬ 
tion) hold also for this more realistic sampling scheme. 

Finally, we compare our optimization guarantees to those recently established for C 0 C 0 A+ M- 
C 0 C 0 A+ is an alternative dual-based distributed optimization approach, which can be viewed as 
including mini-batch SDCA as a special case, and going beyond SDCA to potentially more power¬ 
ful optimization. At each iteration of C 0 C 0 A+, several groups of dual variables are updated. We 
focus on C 0 C 0 A+SDCA, where each group is updated using some number of SDCA iterations. 
When each group consists of a single variable, this reduces exactly to mini-batch SDCA. Allowing 
for multiple SDCA iterations on larger groups of variables yields a method that is more computa¬ 
tionally demanding that mini-batch SDCA, and intuitively should be better than SDCA (and does 
appear better in practice). However, we show that our mini-batch SDCA analysis strictly dominates 
the C 0 C 0 A+ analysis: that is, with the same number of total dual variables updated per iteration, 
and thus less computation, our mini-batch SDCA guarantees are strictly better than those obtained 
for C 0 C 0 A+. Mini-batch SDCA is thus a simpler, computationally cheaper method, with better 
guarantees than those established for C 0 C 0 A+. 

Although SDCA is a dual-method, improving the dual at each iteration, following the analysis 
methodology of ||23l , all our guarantees are on the duality gap, and thus on the primal sub-optimality, 
that is on the actual regularized error we care about. 

2 Setup and Preliminaries 

We consider the problem of minimizing the regularized empirical loss 

min ,V(w) := £E"=i Xi) + §IM| 2 , ( p ) 

w£M d 

where x\,...,x n £ are given training examples, A > 0 is a given regularization parameter 
and 4>i : M —> R are given non-negative convex loss functions that already incorporate the labels 
(e.g. <f>i(z) = 4>{yiz) where y t £ ±1 are given labels). Instead of solving (0. we solve the dual Il23l 

maxD(a) := -££?=!«(-<*) - * \\±X T a\\l (D) 

where </>* (u) : K —> R is the convex conjugate of <j>i defined in the standard way as <^* ( u ) = 
ma x z (zu — <j>(z)) and X = [x±,..., x n ] T £ R nxd is the data matrix, where each row corresponds 
to one sample and each column corresponds to one feature. If a* is a dual-optimum of (O then 
w* = X r a* is a primal-optimum of (0. We therefor consider the mapping w a = j-X T a and 
define the duality gap of a feasible a £ dom(Z>) as: 

0(a) := V(w a ) - V(a). (G) 

Stochastic Dual Coordinate Ascent (SDCA) SDCA is a coordinate ascent algorithm optimizing 
the dual (|D]). At t-th iteration of SCDA a coordinate i £ (n) := {1.2,..., n} is chosen at random 
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and then a new iteration is obtained by updating only the i-th coordinate and keeping all other 
coordinates of a unchanged i.e. a^ t+1 ) = a® + Acr-^e,, where 

Aaf' = argminDfa^ + <5eA (1) 

<5gR 

Assumptions on Loss Function We analyze mini-batched SDCA under one of two different as¬ 
sumptions on the loss functions: that they are L-Lipschitz continuous (but potentially non-smooth), 
or that they are (l/ 7 )-smooth. Formally: i) L-Lipschitz continuous loss: Vi, Va, b € R we have 
\4>i(a) — <f>i(b )| < L|a — b |, ii) (l/ 7 )-smooth loss: Each loss function (pi if differentiable and its 
derivative is (l/ 7 )-Lipschitz continuous, i.e. Va, b £ K. we have \<Pi(a) — <f>i(b)\ < ^|a — &|; iii) We 
also assume (pi are non-negative and that </>i( 0 ) < 1 for all i. 

For a positive vector v = (vi,... ,v n ) T > 0 we define a weighted Euclidean norm ||ck|[^ = 
v i a i- Instead of assuming the data is uniformly bounded, we will frequently refer to the 
weighted norm on R” with weights proportional to the squared magnitudes, i.e. 17 ~ ||a 7 || 2 . 

3 Mini-Batched SDCA 

At each iteration of mini-batched SDCA, a subset S C (n) of the coordinates is chosen at random 
(see below for a discussion of the sampling distribution) and a new dual iterate is obtained by in¬ 
dependently updating only the chosen coordinates. Since each coordinate is updated independently, 
mini-batch SDCA is amenable to parallelization. 

The naive approach is to use the same update rule for each coordinate as in serial case: the update 

is then given by = a® + Y2ieS where Aaf' is given by (|TJ- Such a naive approach 

could be fine if the mini-batch size is very small and the data is “spread-out” enough |2). However, 
more generally, not only might such a mini-batch iteration not be better than an iteration based on 
only a single point, but such a naive mini-batch update might actually be much worse. In particular, 
it is easy to construct an example with just two examples where a naive mini-batch approach will 
never reach the optimum solution, and diverging behavior frequently occurs in practice on real data 
sets f27l . The problem here is that the independent updates on multiple similar points might combine 
together to “overshoot” the optimum and hurt the objective. 

An alternative that avoids this problem is to average the updates instead of adding them up, cA +1 l = 
a W + Aoif 1 l[8j [29]j28j, but such an update is overly conservative: it is not any better than 

just updating a single dual variable, and cannot lead to parallelization speedups. Following l27l . 
the approach we consider here is to use a summed update aP t+l '> = a W + Y2ies Aa^, where the 
independent updates Aaf^ are derived from a relaxation of the dual: 

A af' 1 = argma x-<p*(-an - S) + ^ViS 2 - w^XiS (2) 

s 

When Vi = ||si|| 2 , the update exactly agrees with the dual-optimizing update (|T]). But as we shall 
see, when larger mini-batches are used, larger values of Vi are required, resulting in smaller steps. 
The update Q generalizes |[27l where a single parameter v, = v was used—here we allow to 
vary between dual variables, accommodating differences in ||a 7 ||. 

To summarize, the mini-batch SDCA algorithm we consider takes as input data X, loss functions 
<pi, a distribution over subsets S C (n), which we will refer to as the random sampling S, and a 
weight vector v, and proceeds as shown on AlgorithmQ] 

We will refer to several sampling distributions S, yielding different variants of mini-batch SDCA: 
Serial SDCA. S' is a uniform distribution over singletons. That is, St contains a single coordinate 
chosen uniformly at randomly. Setting v, = \ x, \ \ yields standard SDCA. 

Standard Mini-batch SDCA. S is a uniform distribution over subsets of size b. Distributed SDCA. 
Consider a setting with C machines, n total data points and a mini-batch size b, where for simplicity 
n and b are both integer multiples of C. For a partition of the n coordinates into C equal sized 
subsets {P c }c L l5 consider the following sampling distribution S: for each c = 1..C, choose a 
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Algorithm 1 mSDCA: minibatch Stochastic Dual Coordinate Ascent 

1: Input: X, y, S, v 

2: set = 0g R ra 

3: for t = 0,1,2, ... do 

4: choose St according the distribution S 

5: set a^ +1 ^ = a®; w a = 

6: for i £ St in parallel do 

7: Aaf f = argmax 5 -0* (-cti - S) - ^ ViS 2 - u£x;c) 

8: af +1) = af } + Aaf } 

9: end for 

10: end for 


subset S c C P c uniformly and independently at random among all such subsets of size b/C, and 
then take their union. We refer to such a sample as a (C, ^-distributed sampling. Such a sampling is 
suitable in a distributed environment when n samples are equally partitioned over C computational 
nodes in a cluster fT9lfl6l . When C = 1 we obtain the Standard Mini-batch sampling. 

The main question we now need to address is what weights are suitable for use with each of the 
above sampling schemes, and what optimization guarantee to they yield. To answer this question, in 
the next Section we will introduce the notion of Expected Separable Overapproximations. 


4 Expected Separable Overapproximation 

In this Section we will make use of the Expected Separable Overapproximation (ESO) theory intro¬ 
duced in l20l and further extended e.g. in [T6l [T9llT§l . 

4.1 Motivation 

Consider the t-th iteration of mini-batch SDCA. Our current iterate is o (f) and we have chosen a 
set St of coordinates which we will update in current iteration. We need to compute the updates 
to those coordinates, i.e. Vi £ S t we need to compute Aaf*. Maybe the natural way how to 
define the updates would be to define them such that D{a^ t+1 ' > ) is as large as possible, i.e. that we 

maximize D(a ^ + Yhr-s Ao-^e,). However, this e.g. for hinge loss would lead to a QP, hence 
the computation cost would be substantial. The main disadvantage of this approach is the fact that 

the updates for different coordinates are dependent on each other, i.e. the value of Aa-^ depends 
on all coordinates in S t . This make it hard to parallelize. Considering the fact that S t is a random 
set, maybe one would like to define the updates so that the updates doesn’t depend on current choice 
of St and that they maximize the expected value of I) at next iteration. In this case we are facing 
following maximization problem 


maxE[d(a (t) +f [St ])], (3) 

where f[s t i is a masking operator setting all coordinates of t which are not in set St to zero, i.e. 
(f[S t ])i = U if i £ St and (frs t ])i = 0 otherwise. The expectation in ([3]) is considered over the 
distribution S. After we get the optimal solution t* of (0) we can define Xap = t* for all i £ St- 
Therefore a( t+1 ) = a ^ = a ^ + ^*s t ] ■ However, now the problem (0 is even more 

complicated. The remedy is to replace E[D(a^ + i[s t ])] by its separable lowerbound. Then due 
to the fact that it will be separable, the update for any coordinate i will be independent on the other 
coordinates in S t and moreover, the updates will be obtained by solving ID problem. 

4.2 Lower-bound 

Let us first state the definition of ESO. 
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Definition 1 (Expected Separable Overapproximation (20]). Assume that sampling S has uniform 
marginals. Then we say that function / admits u-ESO with respect to the sampling S if Vx, t £ R” 
we have 


E[/(« + W] < /(«) + *((V/(«), t) + ±||t|| 2 ). 


(4) 


Let us now just assume that we can find such a vector v such that © holds (we show how to find v 
in Section l4~3l> and we now show how to derive the lowerbound of E [D(oS t ' > + f[s t ])]. If we write 
© for a particular choice of /, namely for /(a) = || -^X T a\\% we obtain 


E[||^X> + i [ 5 ] )||i]<|K || 2 + HM( 


^Wl + & TXw <*)- 


(5) 


Now we can derive the expected lowerbound of V as follows 

E[V(a + W] ® E[-±£? =1 tf (-(a + WO] - + hM 




>-slK*ll 2 - Mi 


£E?=i#(-<* - u) - a - ft (-«0 


E[|g|] A, 
n 2 ' 




+ & T Xw a ), 


1ST 
E[|S|]- 


( 6 ) 


where in the first inequality for the first part we have used the fact that the function is separable (see 
Theorem 4 in l20l ). If we define 

:= -££?=!#(-(«< + U)) ~ |IK || 2 - - ±t T Xw a , (7) 

then it is easy to see that we can find a separable (in t ) expected lower approximation of V, i.e. it 


EJ 


|Sj] is the average 


holds Va, t £ R" that E[2?(a + fr&)] > a) + (l — V(a), where b := E 

number of mini-batch. Now let us note again that it is very hard to maximize E 

t, but maximize of 'H(t. a) in t is very simple, because this function is simple and separable in t. It 
is also easy to verify that the steps in Algorithm[T|are maximizing 'H. 


V (a+ *[,§]) 


in 


4.3 Computing ESO Parameter 


In previous Section we have shown that using ESO we can find a separable lowerbound of E [D(a + 
t jgj)]. However, we haven’t explained how the ESO parameter (vector v) can be obtained. 


In this Section we present some of the results obtained in literature |2()1 ~6 (161131 for formulas for 
computing vector v for samplings described in Section [3 Let us mention that all formulas are data 


dependent. Some of them involves the spectral radius of following matrix D 2 XX T D 
D = cliagj A'A ' 7 ) which we will denote by a 2 , hence a 2 := max aGR „. | H | = 1 i|| X T D~*a. 
Note that this can be in practise impossible to compute (we can estimate is using e.g. power method) 


where 
2 


or we can use an upper-bound (derived in Lemma 5.4 0 ) by u> = max — : 

*6(n) n 




\\xi\\o(Xi ej) 


i ) 2 


where 


by ||xi||o we have denoted a number of non-zero elements of z-th data point. 


Serial SDCA. In this simplest case we can define 14 = ||xi|| 2 . 

Standard Mini-batch SDCA. In standard mini-batch we can choose i>j = (1+ )ll a 't|| 2 - 

If the data matrix X is sparse, we can define = J2j=i( x f e i) 2 (^ + ^°~ 1 )■ 

Distributed SDCA. In distributed case we can choose v t = (1 + ’ey ) 11 x i 11 2 »provided 

that b > 2 C and V{ = (1 + ba 2 )\\xi || 2 if b = C. A simple upper-bound valid for any b can be derived 
as follows Vi = 2(1 + 6 fj 2 )||Xi|j 2 . 


5 Convergence Guarantees 

We are now ready to present optimization guarantees for Algorithm[T|based on the ESO parameters 
studied in the previous Section. These theorems extends the serial case of (23l to mini-batch setting. 
The Theorems are based on weights v are chosen such that /(a) = ||^R-A' T a || 2 admits u-ESO for 
a sampling S used in the AlgorithmQ] Proofs are provided in the supplemental material. 
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Theorem 2 ((l/ 7 )-Smooth Loss). If the losses are (I /7 )-smooth and f{a) = ||^-X T a || 2 admits 
v-ESO for the sampling S, then for a desired duality gap eg > 0, using Algorithm\I] if we choose 


T > 



n 



)iog( t f 




( 8 ) 


we have that E[P(wt) — 2?(q!t)] < ep. To obtain E[P(w) — 25(a)] < eg , it is sufficient to choose 
T 0 > { T-T 0 )e g )’ where 


a = 


1 v^T-i 
T-To l^t=To+l 


a®. 


(9) 


Moreover, iff > IHI ” A + Aw7 log( IHI ” A * An7 -±- p ) then P {V(w f ) - V(a f ) < eg) > 1 - p. 

Theorem 3 (L-Lipschitz Continuous Loss). If the losses are L-Lipschitz and /(a) = ||4-X T a || 2 
admits v-ESO for the sampling S, then for a desired duality gap eg > 0, using Algorithm\T\ denoting 
G = 4 L 2 ^‘ =1 Hi if we c h oose 

n J 


To>t 0 + l(£L- 2 n) + , T > To + max{r?] ,l_^ } , (10 ) 

t 0 > max(0, \^\og{2\ne^/G)\), C 11 ) 

we have that E[P(w) — 25(a)] < eg, where a is defined in (O Moreover, when t > To, we have dual 
sub-optimality boundE\D(a*) — T>(a^)\ < \eg. 


6 Guarantees and Speedups for Specific Sampling Distributions 

Theorems[3]antl[2]are stated in terms of ESO parameter v. Let us now consider the specific sampling 
distribution of interest. Assume for simplicity ||iEj|| < 1, and define 

4*1 = 1 4td = 1 + mlK-)? %St = (12) 

for the serial, standard and distributed sampling schemes respectively, with overall mini-batch size 
b and distribution over C machines. Using the weights v t = ft, we then have the following obtain 
the following iteration complexities: 

(I/ 7 ) -Smooth Loss. In this case © in Theorem [2] becomes T > + j) log(^(-^ + 

and hence the iteration complexity is (ignoring logarithmic terms): O fj + ■ 

L-Lipschitz Continuous Loss. Combining equations ( ITOb and (fill , and again ignoring logarithmic 
factors, we get an iteration complexity of: 

O (? + !£)• (13) 

Plugging in f3 s td into dT3l> recovers the previous analysis of Lipschitz loss with standard sampling. 

Both the Lipschitz and smooth cases involve two terms: the first term, ?, always displays a linear 
improvement as we increase the mini-batch size. However, in the second term, we also have a 
dependence on the data-dependent 1 < f < b, which depends on the mini-batch size b. We will 
have a linear improvement in the second term, i.e. potential for linear speedup, as long as = 0(1). 
For standard sampling we have that ~ 1 + her 2 , and so we obtain linear speedups as long as 
b = 0(l/cr 2 ), as discussed in ( fl3] >. We can now also quantify the effect of distributed sampling and 
see that it is quite negligible and yields almost the same speedups and the same maximum allows 
mini-batch size as with standard sampling. Note that typically we will have C <C b, as we would like 
to process multiple example on each machine—otherwise communication costs would overwhelm 
computational costs ll26l . The analysis supports this choice as well as the extreme choice C = b. 

Focusing on the smooth loss, it is possible to obtain a linear reduction in the iteration complexity 
(corresponding to linear speedups) for SGD with mini-batch size of up to 0{yfn) without any data- 
dependent assumption, that is regardless of the value of /3 HOD El. Is this possible also with SDCA? 
Indeed, even if we don’t account for the data dependent quantity /?, since we always have /3 < b, 


6 

















then the iteration complexity of SDCA for mini-batch SDCA with smooth loss is: 0{l/{\ r y) + 
n/b) log(l/e)) a larger mini-batch scales the second term (unconditional on any data dependence), 
and as long as it is the dominant term, we get linear speedups. Now, to get the min-max learning 
guarantee, we need to set A = 0(l/y / n)(see (24)). Plugging this in, we see that we get linear 
speedups up to a mini-batch of size O^jy/n). Unsurprising, this is the same as the mini-batch SGD 
guarantee. Now, if we do take data-dependence into account, we have /3 = 0(1 + ba 2 ) (where a 2 is 
as defined above). As long as b < 1/a 2 , we get linear speedps even if the 1/A term is dominant, i.e. 
regardless of the scaling of lambda relative to n. This is good, because in practice, and especially 
when the expected error is low, the best lambda is often closer to 1 /n and not 1 / 1 Jn. Returning to the 
worst-case rate and A = 1 / y/rri : we now have an allowed mini-batch size of up to b = O^y/n/a 2 ) 
while still getting linear scaling. That is, we can combined the benefits of both smoothness, where 
we can scale the mini-batch size by yfn, and the data dependence, to get an additional scaling by 
1 /a 2 . 


7 Comparison with C 0 C 0 A+ 

C 0 C 0 A+ HD is a recently presented framework and analysis for distributed optimization of the 
dual {D]i: Data (and hence dual variables) are partitioned among C machines (as in our distributed 
sampling), defining C subproblems, one for each machine. At each iteration, the set of dual variables 
of each of the C machines are updated independently, and then communicated and aggregated across 
machines. Different local updates can be used, and the C 0 C 0 A+ analysis depends on how well 
the update improves the local subproblem. Here we will consider using local SDCA updates in 
conjunction with C 0 C 0 A+: at each iteration, on each of the C machines, b/C dual variables are 
selected (as in our distributed sampling), and H iterations of SDCA are performed sequentially 
on these b/C points (in parallel on each of the C machines, and while considering all other dual 
variables, including all variables on other machines, as fixed). 

We will consider for simplicity 1-smooth loss functions and compare the C 0 C 0 A+ guarantees on 
the number of required iterations fl4l to the SDCA gurantees we present here, noting also the 
differences in the amount of computation per iterations. In all our comparisons, the required com¬ 
munication in each iteration of SDCA and C 0 C 0 A+ is identical and amounts to a single distributed 
averaging of vectors in 

Setting b = C and H = 1, we exactly recover mini-batch SDCA with a minibatch of size b, 
and so we would expect the C 0 C 0 A+ analysis to yield the same guarantee. However, our guar¬ 
antee on the number of required iterations in this case is (ignoring log factors) O 

compared to the C 0 C 0 A+ guarantee (ignoring log factors): O ^ + j + where a 2 = 

max c max Q .^ \\onxi\\ 2 =i (^HX^e-p a i x i II) > a 2 > 1 jn. Our guarantee therefore dominates 
that of C 0 C 0 A+: the second term is worse by a factor of no 2 > 1, the third by a factor of 1/a 2 < 1 
and the fourth term in the C 0 C 0 A+ bound, can be particularly bad when A is small (e.g. when 
A oc 1/ri). 

Setting b > C and H = b/C, both minibatch SDCA and C 0 C 0 A+ perform the same number 
of SDCA updates (same amount of computation) at each iteration, but while minibatch SDCA’s 
updates are entirely independent, each group of H C 0 C 0 A+ updates (the H updates on the same 
machine) are performed sequentially. We would therefore expect CoCoA+’s updates to be better, 
and therefore require less iterations. Unfortunately, the C 0 C 0 A+ analysis does not show this. 

To see the deficiency in the C 0 C 0 A+ analysis at another extreme, consider the case where b = 
n, 1 < C < n and H — > 00 . In this case, each iteration of mini-batch SDCA is actually a 
full batch of parallel updates (updating each coordinate independently), while for C 0 C 0 A+ this 
corresponds to fully optimizing each group of n/C dual variables using many SDCA updates (and 

thus much more computation). Still, the C 0 C 0 A+ iteration bound here would be O ^1 + —j-J, 
where a' = max Q j? || jry a H anc * so a '® 2 — fj2 ’ compared to the better mini-batch SDCA 
bound O (l + <£)• 
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Table 1: Basic characteristics of datasets; obtained from libsvm collection fl2l . 


name 

# train, samples 

# test samples 

# features 

Sparsity 

epsilon 

400,000 

- 

2,000 

100% 

rcvl 

20,242 

677,399 

47,236 

0.15% 

news20 

15,000 

4,996 

1,355,191 

0.03% 

real-sim 

72,309 

- 

20,958 

0.24% 


epsilon news20 rcvtest real-sim 



Figure 1: TOP ROW: Number of iterations needed to get an approximate solution is almost the same for 
standard SDCA and distributed SDCA for C £ {1, 2,4, 8,16}. BOTTOM ROW: Comparison of mSDCA and 
C 0 C 0 A+ when solving the SVM dual problem on C = 4 computers (left) and C = 16 computers (right). 


And so, even though C 0 C 0 A+ with SDCA updates should be a more powerful algorithm, its analysis 
m fails to show benefits over the simpler mini-batch SDCA, and out analysis here of mini-batch 
SDCA even dominates the C 0 C 0 A+ analysis. The reason for this is that C 0 C 0 A+ aims to be a more 
general framework capable of including arbitrary local solvers. Hence, necessarily, the analysis 
must be more conservative. 

8 Numerical Experiments 

In this Section we show that the cost of distribution is negligible (in terms of # iterations) when 
compared to standard mSDCA. We also show that if 6 1, then C 0 C 0 A+ is faster than mSDCA in 

practice. We have run experiments on 4 datasets (see Table Q}. Note that most of the datasets are 
sparse (e.g, news20: an average tsample depends on 385 features out of 1.3M). 

Standard vs. Distributed SDCA. Figure Q] (top row) compares standard and distributed SDCA. 
Recall that distributed sampling with C = 1 and standard mini-batch sampling coincide. On the x- 
axis is the parameter 6 and on the y-axis we plot how much more data-accesses we have to as b or C 
grow, to get achieve the same accuracy. We see that the lines are almost identical for various choices 
of C, which implies that the cost of using distributed mSDCA does not affect the number of iterations 
significantly. This is also supported by the theory (notice that in (IT2t we have jsf/'^std ~ !)■ 
Also note that, for news20 for instance, increasing b to 10 4 implies that the number of data-accesses 
(epochs) will increase by a factor of 11, which implies that # iterations will decrease almost by 1,000 
for b = 10 4 when compared with 6=1. 

mSDCA vs. C 0 C 0 A+. In Figure Q] (bottom row) we compare the mSDCA with C 0 C 0 A+ with 
SDCA as a local solver. We plot the duality gap as a function of epochs (if communication is 
negligible then the main cost is in computation) or iterations (if the communication cost is significant 
than this is the correct measure of performance). As the results suggest, is the communication cost it 
negligible then the mSDCA with small b is the best (as expected), however, if communications cost 
is significant, then C 0 C 0 A+ with large values of H significantly outperforms mSDCA. 
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A Technical Results 


Lemma 4 (Lemma 2 in |23l ). For all a £ R n : 


Moreover V( 0) > 0. 


V(a) < V(w*) < V(0) < 1. 


(14) 


Following Lemma is a minibatch extension of Lemma 1 in |23l . 

Lemma 5 (Expected increase of dual objective). Assume that <p\ is 7 - strongly convex ( 7 can be also 
zero). Then, for any t and any s £ [0,1] we have 


where 


E[D(a^) - D(aW)] > 6(^(a W ) - d) 2 ^% 


G (t) = i(||w (t) - a 


(t)i|2 _ jAn(l-s) 11 (t) 


— a 


(*)n 2 i 




(15) 


(16) 


u t = {uf \ ... ,Un ] ) T and -u'P £ d<j>i{vF w Xi). 

Lemma 6 (Lemma 3 in |23l ). Let (f> : R — > R be an L-Lipschitz continuous. Then for any |a| > L 
we have that cf>* (a) = 00 . 


Following lemma is a small extension of Lemma 4 in l23l to obtain more tide bounds in case each 
sample has different norm or when ESO bound is used. For example, in serial case we will have that 

Vt : G* < 

Lemma 7 (Bound on G^ f> ). Suppose that for all i, (j> t is L-Lipschitz continuous. Then 

E TL 

( 17) 

— n v ' 


Proof. Indeed, 


G (t) ‘S* ±T,U(vi - 2 ^ £i )K w - <*V) 


(Lemma[ 6 ]i 

( ‘ h2 < 


□ 


Lemma 8 (Theorem 1 in I2H 1. Fix Xq £ R N and let {atfc}fc>o be a sequence of random vectors 
in R w with Xk+i depending on Xk only. Let ■ R w —> R be a nonnegative function and define 
Lastly, choose accuracy level 0 < e < £o> confidence level 0 < p < 1, and assume that 
the sequence of random variables {£fc}fc>o is nonincreasing and has one of the following properties: 


(i) E[£fc + i | Xk] < (1 — f°r all k, where c\ > e is a constant, 

(ii) E[£fc +1 | Xk] < (1 — for all k such that £k — e, where C 2 > 1 is a constant. 

If property (i) holds and we choose K > 2 + ^(1 — ^ + log(i)), or if property (ii) holds, and we 

choose K > c-i log(—), then P(£x < e) > 1 — p. 


B Proofs 


B.l Proof of Lemma H| 

Let us define T,, as an unique maximizer of a function 'Hit, a) defined in (|7J, i.e. 

T a := argmaxTt(f, a). (18) 

Let us now state some basic properties about function H. We have that Vt, a £ R" and sampling S: 
• "H(0, a) = V(a), 
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• from ESO we have 


E[D(a + i [S] )] > (1 - ±)V{a) + $H(t, a), (19) 


• H(t, a) < 'H(T q , a). 

Convex conjugate maximal property implies that 

= -u^wl^Xi - MwlwXi). 

Let us estimate the expected change of dual objective. 


( 20 ) 


E[X>(a (t) ) -P(a (t+1) )] = T E [V(a W )-V(a^ + (T aW ) [g] )] < V(a W ) - U(T aW , a (4) ) 


GU 


i n 

= -£ («(-(«< + ( T o«) (i) )) - 


i=l 


+ 


An 


*<«) 


+ 2 


An 


i(‘) 


Xw 0 


< - 55 (^i ( _ K W + s ( u * - a f } )) “ 

2=1 

+ ^ (||^ s ( u -“ (t) )|L + 2 (3k«( u - Q!(t) )) • 

Using 7 -strong convexity of (j>* we have that 

4>i {-{of + «(«» - af } )) < scj)*(-Ui) + (1 - s)cj)*(-af ) ) - 2(1 - s)s(ui - af ] ) 2 . (21) 


Therefore, 


^E[T(a w )-D(a (t+1) )] < - 55 (s<t>i(~ u i) + su ixf w aM - s^*(-af } ) - 2(1 - s)s(«i - af } ) 2 )) 

2=1 

+ ^ (||^(— W )|L+ 2 (^(-^)) ^aw) 

| 2 qJ n 

< - 55 (~ u i W aW X i - M w lw x i) + Uixfw a (t) - 0* (“O^))) 

2=1 

+ ^ (-^( 1 -s)s||w-a|| 2 + ^s(u-a (t) ) ^ + 2 (^s(-a (t) )) Xw aW 

Substituting the definition of duality gap ([CT| we obtain 


n 

^E(D(a (t) ) - £>(a (4+1) )] < ^ 55 {-M w lwXi) - (j>* (-af } ) - 

2=1 

+ ^ (—~ s)s||n — a|| 2 + ^s(n-a (4) ) J 


A 


= -s5(a (t) ) + ^( ^s(n — a (4} ) - -^-(1 - s)s||u - a|| 

2 7 nA(l — s).. 


2 
1 

2A Vn 


= -s£(a (t) ) + -L (-) ( (n-a (4) ) 


Multiplying both sides by — ^ we obtain ( IT5] >. 
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B.2 Proof of Theorem [3| 

At first let us estimate expected change of dual feasibility. 

E[eg +1) ] ® E[D(a*) - £>(a (t+1) )] = E[D(a*) - £>(a (t+1) ) + £>(a (t) ) - V(a w )} 

= E[I)(o(‘>)-D(„(*+«)+ e g>] ra < ra -l,(l«(o(*>)- (i) 2 T G ^) +E|e g)] 

6 G 2 = a - 6 *> E I-S > ] +' 6 0 2 2A«- < 22 > 


< -£ 


From the above follows that 


t -1 


E[e«] < (1 - + 6 Q ^G£(l ~ ^ < (1 ~ + © 

4=0 

Choice of s = 1 and t = to := max{0, |~^ log(2Ane^/(G))]} will lead to 


G_ 

2 A 


Efe‘01 < (i _ i>yo e (°) + lE < G C (Q) + _ E 

1 d) S ' J D + »2A S 2Aneg> D "2A An' 


Following the proof in (23] we are now going to show that 

T‘)i - 2G 


Vi > to : E[e]y] < 


(23) 


(24) 


(25) 


D 1 A(2n + b(t — to)) ’ 

Clearly, (l24l > implies that ( l25l > holds for t = to- Now imagine that it holds for any t > to then we 


show that it also has to hold for t + 1. Indeed, using s = 2 n +b(t-to) 

1 


we obtain 


E[ £ r i} ] < (l-^)E[ C g>]+&© -G 


< ( 1 -^) 

= ( 1-6 


2G 


S\ 2 1 




A(2n + b(t — to)) Vn/ 2A 
2 2G 


G 


2n + b{t — to) A(2n + b(t — to)) \2n + b(t — to)J 2 A 


-G 


2 G 


1 


A y 2n -t- 6(i — io) d - 6 


2n + b(t — to) d- 6\ f2n + b(t — to) — b 
1 ) V (2n + b(t - t 0 )) 2 


2 G 


(2 n + b(t — t 0 ) + b)(2n + b(t — to) — 6) 


< 


A(2n + b(t — to) + b) 
2 G 


(2 n + b(t - t 0 )) 2 


A(2 n + b(t — to) + b)' ^ 

In the last inequality we have used the fact that geometric mean is less or equal to arithmetic mean. 
If a is defined as (0 then we obtain that 


E[£(a)] = E 

03 


'T-l 




To 




\t=To 
"T-l 


< _ L _ 

- T-T 0 


E 


T-l 


£»(»“>) 


_i—T q 


— T-T 0 


E 


ED n 1 1 

< - 


s 6 T - T 0 

n 1 1 

< - 


E(- 7 >(“ (,> )-^«>)] + (i)4(i 

Lt=T 0 v v 

(E[D(am)]-E[D(^))])+l| 
(p(a*)-E [^(« (To) )])+^- 


ut — a® 


sbT - T 0 

Now, if T > \^~\ + Tq such that To > to we obtain 

, ( 2 G 

E[£(a)] < -rrh 


( 27 ) 


sb T - T ° \X(2n + b(T 0 — t 0 )) 


G_ 

" 2A 


1 


G 


A \sb{T-T 0 ) \{2n + b{T 0 -to)) 


s 

2 n 
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Using s = fc ^ T ” T , o - ) we obtain that 

E[£(n)] < — ( 2 n +(T 0 -t 0 ) + 2(T — To)) ' 

To have this quantity < eg we obtain that T, to,To has to satisfy ( ITOb - The fact that To > to + 
| — 2nj implies that right-hand site of (l26l > is < eg. 


B.3 Proof of Theorem |3 


If function fa is (l/ 7 )-smooth then fa is 7 -strongly convex. If we plug s = s = Xn +\ ni € (0,1) 
into dT6l> we obtain that Vf : G® < 0. Hence (fl5l > will read as follows 

E[V(a^) - 2?(«W)] > b-Q(a^) = b G(a^) > b ^ (V(a*) - 2>(a W ))- 

n IMU + Arry IMI 00 + A 717 

(28) 

Using the fact that E[T>(eM + 1 3) — D(a^ t ' ) )\ = E[2?(a ( - t+1 - ) ) — T>{a*)\ + V(a*) — T>(a®) we have 
E [V(a*) - T(a (t+1) )] < (l - P(«*) ~ V(a^)). (29) 

Therefore if we denote by e^ = V(a*) — D(a W) we have that 


E[e£ } ] <11-6 


A 7 


6 |H|oo + An 7 


{13 


A 7 


Halloo + An7 

Right hand site will be smaller than some eu if 


< exp I —bt 


A 7 


11 tt 11 00 + An7 


t > 


IHIc 


+ 


b VA 7 IMI 

Moreover, to bound the duality gap we have 


log- 


To 


(> ,, || g(« (t) ) f E[$ - e^ +1) ] < e<g. 

Halloo + An7 


(30) 


Therefore G(a^) < Hence if en < j^\\zrtx^ e G then G(fa^) < eg. Therefore 


after 


t > 


1 


■u||oo+An7 


b VA 7 IMI 


log 


n \ 1 


b \\~/ \\v\\ooJ eg) 


iterations we have duality gap less than eg and the first part of the proof is done. To show the second 
part of Theorem let us sum (l30l) over t -- To, ..., T — 1 to obtain 


E 


1 


T-l 


T-Tr 


E smE 


t=To 


< + 1 E[V{a^) - T>(a™)]. (31) 

0A7 1 — ±0 


Now, if we choose w, a to be either average vectors or a randomly chosen vector over t e {To + 

1,..., T}, then we have 

E[S(a)] f h \^ Xn \ l _ T nv^ T) ) - V(a™)] < M^±E2_2_ E[P(a *) _ D{a™)\. 
Hence to have E[£/(a)] < eg it is sufficient to choose 


E [ e ^ 0) ] ^ |||| 6A E ( T-T 0 )eg. 
IMloo + An7 


Therefore we need Tq to satisfy 


To > 


|w||c 


1 


b VA 7 \\v\\ 


log 


MI°o ( 1 


b VA 7 IMloo J{T-T 0 )egJ' 
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^||oo+An7 


To get a high probability result we use Lemma[8]with = T>(a*) — V(a^), C 2 = bXl 
(see ( f29b ) and e = n 7J n bA ^ AT7 eg to obtain that after 


||u || 00 +An 7 

T = c 2 log 


LemmaH + An 7 ^ /J_\ 

V e P ) ~ hx l g \ e p) 


1 - 


P < P (l>(a*) - V(a^) < e) <? P ( ^ An7 g ( aW ) < e ) = P (s(a (t) ) < e<?) • 
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